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Abstract 

We present a multiprocessor "stop-the-world" 
garbage collection framework that provides multi- 
ple forms of load balancing. Our parallel collectors 
use this framework to balance the work of root scan- 
ning, using static overpartitioning, and also to bal- 
ance the work of tracing the object graph, using a 
form of dynamic load balancing called work steal- 
ing. We describe two collectors written using this 
framework: pSemispaces, a parallel semispace col- 
lector, and pMarkcompact, a parallel markcompact 
collector. 



1 Introduction 

The Java™ programming language is increas- 
ingly used for large, memory-intensive, multi- 
threaded applications run on shared-memory multi- 
processors. Most Java virtual machines (JVM™'s) 
employ "stop-the-world" garbage collection (GC) al- 
gorithms that first halt the running threads and then 
perform the GC. If we have more than one processor 
available, it makes sense to employ them all in the 
GC process. This paper describes the paralleliza- 
tion of two sequential GC algorithms to allow them 
to take advantage of all available processors. 

The Java Technology Research Group at Sun 
Microsystems Laboratories 1 has developed a JVM 
that includes a GC interface [12] to support mul- 
tiple GC algorithms, thus enabling comparison of 
various GC strategies in a high performance vir- 

1 www.sun.com/research/jtech 



tual machine. This paper describes our augmen- 
tation of this interface with a parallel infrastruc- 
ture to support multiple parallel GC strategies. 
We use this infrastructure to parallelize two well- 
known collection schemes: a two-space copying al- 
gorithm (semispaces) and a mark-sweep algorithm 
with sliding compaction (markcompact). The result- 
ing algorithms have outperformed their highly-tuned 
product-quality sequential counterparts on multi- 
processors. 

In parallelizing sequential GC algorithms, one has 
to tackle two key issues: load balancing of all parts 
of the algorithm and re-engineering of any inherently 
sequential elements. 

In both algorithms, the key to load balancing is to 
correctly and efficiently partition the task of tracing 
the object graph. This task unfortunately does not 
lend itself to static partitioning. Our approach, de- 
scribed in the following sections, is to combine static 
partitioning with dynamic load balancing based on 
work stealing. We show that this combination of 
static and dynamic methods leads to effective paral- 
lelization of both the semispaces and markcompact 
collectors. It is our belief that the effectiveness of our 
dynamic partitioning is the result of a finely-tuned 
lock-free work-stealing algorithm based on Arora et 
al [1] whose low overhead allows us to balance our 
work at the individual object level. 

In both algorithms there are parts that are not 
easily parallelizable. In the semispaces algorithm 
these included: installing forwarding pointers, allo- 
cating in parallel, and scanning the card table for 
references from the old generation. Though the in- 
stallation of forwarding pointers is not parallelized, 



it is performed in a lock-free manner. We make 
allocation less of a sequential bottleneck by glob- 
ally allocating local buffers from which objects may 
be allocated without synchronization. Scanning the 
card table requires a novel partitioning scheme to 
achieve good load balancing. In markcompact the 
main inherently sequential part is the compaction 
phase, which involves copying all objects to one end 
of the heap. Here we statically partition the old 
generation heap into n partitions, compacting even 
partitions in one direction and odd partitions in the 
other direction, thus avoiding synchronization and 
optimizing the size of the free areas. 

1.1 A short description of previous work 

Endo et al [11] describe a parallel stop-the-world 
GC algorithm using work stealing. Their algorithm 
depends on threads with work copying some work 
to auxiliary queues, where the work is available for 
stealing. Threads without work look for an auxil- 
iary queue with work, lock the queue, and steal half 
of the queue's elements. Our work extends theirs 
by using a lower-overhead work-stealing mechanism, 
and by addressing the harder problem of paralleliz- 
ing relocating collectors, not just a non-relocating 
mark-sweep algorithm. 

Halstead [6] describes a multiprocessor GC for 
Multilisp. Each processor has its own local heap, 
and they use lock bits for moving and updating for- 
warding pointers. Load balancing is done statically 
rather than dynamically. 

Many collectors operate concurrently with muta- 
tor activity [9, 4, 5, 8]. This kind of concurrency is 
orthogonal to the style of parallel collection we de- 
scribe in this paper. A collector might combine both: 
some concurrent collectors have stop-world phases 
that might be performed in parallel, and collectors 
with concurrent GC threads might use several such 
threads working in parallel to cope with high aggre- 
gate garbage-creation rates in multi-threaded pro- 
grams. 

Steensgaard [10] explores a clever method for par- 
tially parallelizing collection. Compile-time analysis 
identifies allocation sites that allocate objects that 
never escape the allocating thread (are never acces- 
sible to other threads.) Such objects are allocated 
in a thread-local heap, which can be collected inde- 
pendently of other threads. This technique avoids 
the synchronization issues that general parallel col- 
lection must address, but requires extensive and ex- 
pensive static analysis, and only a subset of objects 
may be collected thread-locally. 



1.2 Overview 

Section 2 presents basic parallel programming 
techniques. Section 3 presents our parallel GC in- 
frastructure, which applies these techniques to the 
garbage collection problem. Sections 4 and 5 de- 
scribe two parallel algorithms we implemented us- 
ing this infrastructure: pSemispaces and pMarkcom- 
pact. Section 6 presents results for three bench- 
marks. Section 7 presents conclusions. 

2 Parallel Programming Basics 

If we had a predetermined amount of work to do 
and were able to partition it perfectly across all avail- 
able processors, we would achieve perfect parallelism 
and finish the collection in the least possible amount 
of time. Some tasks can be partitioned in this way; 
we call them statically partitionable. Other tasks are 
difficult to divide into subtasks of predictable size. 
For example, tracing the graph of a program's live 
data is difficult to subdivide a priori, because it de- 
pends on the shape of the object graph. Many tasks 
fall somewhere in between: we are able to partition 
them statically into roughly, but not exactly, equiva- 
lent subtasks. We overpartition such tasks. That is, 
we break the tasks into more subtasks than we have 
threads, and then each thread dynamically claims 
one subtask at a time. 

There are two motivations for overpartitioning. 
First, the number of processors available to the GC 
process is unpredictable due to load on the machine 
from other processes. If a task were divided into ex- 
actly n subtasks on an n-processor machine, and one 
of the processors were unavailable, then one proces- 
sor would have to complete two subtasks, thereby 
doubling the time for the computation. With over- 
partitioning, this extra subtask would be divided 
into several smaller subtasks that may be distributed 
across the active processors. Second, when we only 
have a rough estimate of how much work each sub- 
task represents, assigning just one task to each pro- 
cessor risks one of those tasks being significantly 
larger than the others. Overpartitioning both de- 
creases this risk by making smaller subtasks, and 
enables processors that have finished smaller sub- 
tasks to take on additional work. 

Some tasks are not even approximately statically 
partionable. These tasks require some form of dy- 
namic load balancing. Work stealing [2] is a highly 
effective load balancing technique in such situations. 
In this approach, each thread works on its own tasks 
until it runs out of work, and then takes the initia- 
tive to steal work from one of the other processors. 



2.1 A short explanation of lock- free 
work-stealing queues 

Arora et at. present a non-blocking implementa- 
tion of a double-ended queue data structure tailored 
to support work stealing with minimal synchroniza- 
tion. Each thread has its own work queue of tasks. 
There are three fundamental operations: PushBot- 
tom pushes an element onto the bottom of the queue, 
PopBottom pops an element from the bottom of the 
queue, and Pop Top pops an element from the top 
of the queue. PushBottom and PopBottom are lo- 
cal operations that usually require no synchroniza- 
tion. PopTop is used for stealing from other threads' 
queues. 

A parallel algorithm using work stealing starts 
with available tasks distributed among the work 
queues. Each thread uses PopBottom to claim tasks 
from its local queue. Execution of this task may re- 
veal new subtasks, which are then added to the local 
queue using PushBottom. When a thread runs out 
of work it uses PopTop to steal a task from some 
other thread's work queue. Synchronization is re- 
quired only when stealing an element from another 
queue or when claiming the last element from the 
local queue. 

We modified the algorithm of Arora et al. in sev- 
eral ways. We added a termination detection pro- 
tocol to ensure that all work is complete before any 
thread terminates. We also added support for fixed 
size queues in the form of an overflow detection and 
handling mechanism. 



3 Parallel GC Infrastructure 

3.1 Balancing root scanning 

Garbage collection computes the transitive clo- 
sure of objects reachable from a set of root pointers. 
In our JVM, the root set consists of class statics, 
thread stacks, etc. We overpartition these roots into 
groups, and the GC threads compete dynamically 
to claim root groups. Even if the static partition- 
ing succeeds in balancing root scanning, starting off 
with balanced groups is not sufficient. Some roots 
may lead to large data structures, while others may 
lead to single objects. 

3.2 Balancing traversal of live data 

We solve this problem by using work stealing to 
dynamically balance the load. The tasks are refer- 
ences to objects to be scanned, i.e., examined for 



pointers to other objects. 2 A scanning GC thread 
acquires an object reference either from its local 
queue or by stealing from another thread's queue, 
and pushes any outgoing references found in the ob- 
ject onto its local queue. The termination detection 
protocol is used to determine the completion of the 
transitive closure. 

Consider the behaviour of this algorithm on a 
large linked data structure, say a binary tree. One 
thread will scan a root pointer referencing the top- 
level node of the tree, push both child nodes onto 
its work queue, and then pop one of the child nodes 
for processing. The other child node is now available 
for stealing. In this way, for a sufficiently large tree, 
the load will be dynamically balanced. 

3.3 Termination detection 

The termination protocol is based on a status 
word containing one bit for each thread participating 
in the GC. All threads start off marked active. As 
long as a thread has local work, gets work from the 
overflow lists (see section 3.4), or succeeds in stealing 
work, its bit in the status word remains on. Once it 
is unable to find work it sets its status bit to off and 
loops, checking to see if all the status bits are off. If 
so, then all threads have offerred to terminate, so the 
algorithm is complete. If not, the thread peeks at 
other threads' queues, attempting to find one with 
work to steal. If it finds a thread with work to steal, 
the thief sets its status bit to active and tries to steal 
the work. If it succeeds, it goes back to processing. 
If it fails, it sets its status bit back to inactive and 
resumes the loop. 

Our colleague Peter Kessler has suggested replac- 
ing the status word with an integer indicating the 
number of active threads. To offer termination, a 
thread would decrement this count with an atomic 
instruction; if the count goes to zero, all threads 
have terminated. When an inactive thread becomes 
active, it would increment the count, again with an 
atomic instruction. This avoids the parallelism lim- 
itation imposed by the bit-width of a word, but we 
have not yet implemented this proposal. 

3.4 Handling overflow in GC work- 
stealing queues 

In order to avoid allocation during GC we allo- 
cate fixed-size work-stealing queues at startup time 

2 For large objects, especially large arrays of references, it 
might be advantageous to consider the object as comprised of 
several chunks, and subdivide the object-scanning task into 
the separate tasks of scanning each chunk. We have not im- 
plemented this extension. 
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Figure 1: Overflow sets 

and use them for all GC's. This required modifi- 
cations to the work-stealing code to check for over- 
flow, and a mechanism for handling overflow grace- 
fully by offloading some items to a global overflow 
set Threads without work look to the overflow set 
for work before resorting to stealing. We wished to 
be able to handle overflow without any additional 
storage space and also to avoid "thrashing" of ob- 
jects between the overflow set and the work-stealing 
queues. 

We modified PushBottom to check for possible 
overflow before adding an element. If adding an ele- 
ment would cause the queue to overflow, we pop all 
elements in the bottom half of the queue and add 
them to the overflow set. 

The overflow set mechanism, due to our colleague 
Ole Agesen, exploits a class pointer header word 
present in all objects in our implementation. As 
shown in Figure 1, for each class X, we link all in- 
stances of X in the overflow set together in a linked 
list whose head is contained in the class structure for 
X. Each object's class pointer is overwritten with the 
"next" pointer of the list. This does not destroy in- 
formation, since all objects in a given class's list are 
instances of that class. All classes with instances in 
the overflow set are linked into another list. This 
mechanism represents the overflow set with only a 
small per class storage overhead. 

By draining only the bottom half of a queue on 
overflow and filling no more than the top half of the 
queue when retrieving work from the overflow set, we 
ensure that no object will be placed in the overflow 
set more than once, thus avoiding "thrashing." 



4 Parallel Semispaces 

Semispaces (a.k.a. "copying") collection divides 
the heap equally into two regions: from-space and 



to-space. Objects are allocated in from-space until 
it fills up, then a GC is triggered. Reachable ob- 
jects are copied into a contiguous area of to-space, 
leaving the remaining space free for allocation. As 
the GC traces the transitive closure, it copies each 
object when it is first encountered, leaving a for- 
warding pointer in the from-space copy of the object 
to indicate its new address in to-space. Subsequent 
references to this object are updated with the for- 
warding pointer. 

In the elegant style of Cheney [7], a copy pointer 
tracks the next free address, and a scan pointer 
tracks the next object to be scanned. The GC scans 
the object indicated by the scan pointer; it examines 
references in the object, copying any referenced ob- 
ject still in from-space to to-space, updating the copy 
pointer. The scan pointer is then updated to point 
to the next object. Collection is complete when the 
scan pointer reaches the copy pointer, at which point 
we swap to-space and from-space, and resume the 
program. 

Our pSemispaces algorithm parallelizes this se- 
quential algorithm. We depend on the infrastruc- 
ture to properly distribute the process of scanning 
the roots. Rather than using Cheney's copy and 
scan pointers to represent the set of objects to be 
scanned, we use explicit work-stealing queues. 

With a parallel copying collector, many threads 
allocate objects in to-space at the same time. One 
approach to managing this concurrency would be for 
each thread to increment the copy pointer atomically 
for each object it copies, using some hardware op- 
eration such as fetch-and-add or compare- and-swap 
(CAS) [3]. However, our experiments indicate that 
this results in too much contention. The alternative 
we adopted was to have each thread use such atomic 
allocation only to allocate relatively large regions 
called local allocation buffers (LABs). A thread can 
then do local allocations within this buffer with no 
synchronization. A thread can also deallocate its 
most recent allocation, which is useful in paralleliz- 
ing the insertion of the forwarding pointer, as we ex- 
plain below. LABs should be large enough to reduce 
contention on the copy pointer, yet small enough to 
avoid excessive fragmentation. Note that the po- 
tential fragmentation introduced by LABs makes it 
possible that to-space may not hold all of the objects 
copied from from-space. However, this is a concern 
only when the heap is very nearly full. 

Collection must preserve the shape of the object 
graph. If several threads are simultaneously process- 
ing references to the same uncopied object in from- 
space, only one may succeed in copying the object. 
The others must observe that the object has been 



copied and update their references according to the 
forwarding pointer installed by the copying thread. 
We accomplish this by having each thread specula- 
tively allocate space for the object in its LAB, and 
then use a CAS to update the froniTspace object's 
forwarding pointer to point to the speculative new 
address. If the CAS succeeds, the thread proceeds 
to copy the object. If the CAS fails, the CAS returns 
the updated forwarding pointer. 3 The thread uses 
this value to update its reference, and then locally 
retracts its speculative allocation. 

The semispaces algorithm is often used in 
youngest generations of generational collectors [7, 
13]. A generational collector has two or more gen- 
erations; objects are usually allocated in younger, 
smaller generations, and promoted to older gener- 
ations if they survive long enough. The hope is 
that youngest-generation collections are significantly 
faster than collections of the entire heap, and likely 
to reclaim sufficient space to continue computa- 
tion. However, multi-threaded programs running 
on multiprocessors will have larger aggregate alloca- 
tion rates than single-threaded programs, and will 
therefore fill a young generation of a given size more 
quickly, increasing collection frequency. It is there- 
fore attractive to increase the size of the youngest 
generation to reduce collection frequency with multi- 
threaded programs, and to use parallelism to keep 
pause times low and throughput high. 

Two further issues must be addressed when us- 
ing the pSemispaces algorithm in the youngest gen- 
eration of a generational collector. First, collector 
threads will allocate both in to-space and in the 
older generation (for promotion). Both forms of al- 
location must be parallelized; old-generation promo- 
tion therefore uses the same LAB-based allocation 
technique as to-space allocation. Second, when per- 
forming a youngest-generation collection, we treat 
all older generation objects as roots. We cannot 
traverse the entire heap to find youngest-generation 
references, or else youngest-generation collection will 
be as costly as collection of the entire heap. There- 
fore, generational systems, including ours, often keep 
track of such old-to-young references using a card 
table, an array whose entries correspond to subdivi- 
sions of the heap called cards. When mutator code 
updates a reference field, it also "dirties" the corre- 
sponding card table entry. The youngest-generation 
collector scans the card table to find these dirty en- 
tries, which are the ones whose corresponding cards 
might contain old-to-young references. 4 
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3 In CAS implementations of which we are aware. 
4 When a card contains old-to-young references after a col- 
lection, the collector leaves the corresponding card table entry 



Figure 2: Parallel Compaction 

For large heaps, scanning the card table may take 
a long time and therefore should be partitioned 
across threads. At first we partitioned this work in 
the most straightforward way: dividing the card ta- 
ble into consecutive contiguous blocks, which were 
claimed by the GC threads. Unfortunately this 
didn't work well on some applications, because some 
blocks were very dense, while others were sparse; 
for example, large arrays of references caused dense 
blocks. Scanning the dense blocks was dominating 
the cost of the GC. To address this problem we in- 
stead overpartitioned the card table into N strides, 
each of which is a set of cards separated by intervals 
of N cards. Thus cards {0, iV, 2N } ...} comprise one 
stride, cards {1, JV+1, 2N+ 1, ...} comprise the next, 
and so on. This causes dense areas to be partitioned 
across tasks. As usual, threads compete to claim 
strides. 



5 Parallel markcompact 

Our old generation uses a markcompact collector. 
The original sequential markcompact collector con- 
sists of four major phases: 

• The marking phase, which identifies and marks 
live objects. 

• The forwarding-pointer installation phase, 
which computes the new addresses live objects 
will have after compaction and stores these ad- 
dresses as forwarding pointers in the objects' 
headers. 

• The reference redirection phase, which updates 
references in live objects to the new addresses 
of the objects they reference. 

• The compaction phase, which copies live objects 
to their new compacted addresses. 

Our pMarkcompact algorithm parallelizes this 
single-threaded algorithm, by parallelizing each of 
the phases. The parallelization of the first three 

dirty. 



phases is relatively straightforward, but the final 
compaction phase presented difficulties. The orig- 
inal sequential compaction phase compacted all live 
data to the low end of the heap. In the parallel case, 
it was difficult to ensure that one thread did not 
overwrite object data that another thread had yet 
to copy. Our solution to this problem is to break 
the heap into n regions, where n is the number of 
GC threads. Each thread claims a region and slides 
live objects in its own region only. (Section 5.2 dis- 
cusses the criteria that influence the selection of re- 
gion boundaries.) The direction to which objects 
are moved alternates for odd and even numbered re- 
gions. Figure 2 shows an example of a heap with 4 
regions and 2 free areas after compaction. In gen- 
eral, a heap with n regions has L 2 ^] contiguous free 
areas. For practical purposes, a small number of suf- 
ficiently large contiguous free areas allows allocation 
as efficiently as a single free area. 

The following subsections describes each parallel 
phase in detail. 

5.1 Parallel marking 

Similar to pSemispaces, the parallel marking 
phase employs the parallel GC infrastructure to stat- 
ically partition the root set and to dynamically bal- 
ance further marking work through work stealing. 
Each thread keeps a work queue of objects to be 
scanned for pointers to other objects. When a thread 
runs out of objects, it attempts to steal an object 
from the work queue of another thread. Unlike 
pSemispaces, which requires synchronization on the 
installation of forwarding pointers, marking is idem- 
potent and therefore requires no synchronization. 5 

5.2 Parallel forwarding-pointer installa- 
tion 

At this point, all live objects have been marked. 
The next phase corresponds to the "sweep" phase of 
a mark-sweep collector, and also has the side-effect 
of computing the distribution of live data, which will 
guide the partitioning of the heap into the regions 
discussed above. First, we overpartition the heap 
into m units of (roughly) equal size. (We ensure 
that unit boundaries are object-aligned, which leads 
to the approximation above.) The value of m is typ- 
ically 4n, where n is the number of GC threads. The 
GC threads compete to claim units; for each unit, 

5 Note that this lack of synchronization also depends on 
having the mark bits present in the object; if an external 
marking array were used then one word might contain several 
marks, which would necessitate synchronization. 



the thread traverses the objects, counting the num- 
ber of bytes of live data in the unit, and coalescing 
contiguous regions of dead objects into single blocks 
traversable in constant time. 

When all units are processed, we know the ex- 
act amount of live data in each unit, and can parti- 
tion the heap into regions with approximately equal 
amounts of live data. The partition is such that each 
region contains one or more of the units created in 
the previous pass, i.e. regions are unit-aligned. Re- 
gions are the partitions used to solve the compaction 
problem; they are the heap divisions in Figure 2. 
The region that contains an object dictates the di- 
rection in which it will be copied. Since we know 
how much live data is in each unit in a region, it is 
straightforward to calculate the new address of the 
first live object in a particular unit, by summing the 
live data of the previous units in the region (in the 
appropriate compaction order for the region). Thus, 
forwarding pointer installation can use the unit par- 
titioning already established. GC threads dynami- 
cally claim units and install forwarding pointers in 
all live objects within the unit. 

5.3 Parallel reference redirection 

Redirecting object references requires scanning 
roots, objects in the current generation, and ob- 
jects in other generations for references to objects 
in the current generation. The forwarding pointers 
inserted by the previous phase are used to update 
these references. We rely on the parallel GC infras- 
tructure to balance the work of scanning the roots. 
Currently, the scanning of the young generation is 
treated as a single task; in the future, this might be 
further partitioned. Within the old generation we 
reuse the previous unit partitioning. 

5.4 Parallel compaction 

The last phase is parallel compaction. As dis- 
cussed previously, we use the larger-grained region 
partitioning in this phase. There is a trade-off here 
between parallelism, which favors more, smaller, par- 
titions, and allocation efficiency, which favors fewer, 
larger partitions (and thus, fewer, larger free areas 
at the end of compaction.) We currently favor al- 
location efficiency, by making the region partition 
an exact partition (as opposed to an overpartition.) 
This design choice will be investigated further in the 
future. 



6 Results 

6.1 Benchmarks 

We present results for three benchmarks. GCOld 
is a synthetic program which can be used to present 
a variety of loads to a garbage collector, includ- 
ing large heaps requiring significant old-generation 
collections. SpecJBB is a scalability benchmark 
inspired by TPC-C which emulates a 3-tier sys- 
tem with emphasis on the middle tier. Javac is a 
compiler that translates Java programming language 
source code to Java class files. 

The GCOld application allocates an array, each 
element of which points to the root of a binary tree 
about a megabyte in size. An initial phase allocates 
these data structures; then the program does some 
number of steps, maintaining a steady-state heap 
size. Each step allocates some number of bytes of 
short-lived data that will die in a young-generation 
collection, and some number of bytes of nodes in 
a long-lived tree structure that replaces some pre- 
viously existing tree, making it garbage. Each step 
further simulates some amount of mutator computa- 
tion by several iterations of an busy-work loop. Fi- 
nally, since pointer-mutation rate can be an impor- 
tant factor in the performance of generational col- 
lection, each step modifies some number of pointers 
(in a manner that preserves the amount of reach- 
able data). Command-line parameters control the 
amount of live data in the steady state, the number 
of steps in the run, the number of bytes of short- 
lived and long-lived data allocated in each step, the 
amount of simulated work per step, and the num- 
ber of pointers modified in a step. We ran GCOld 
with 300MB of live data, allocating three bytes of 
short-lived data for every byte of long-lived data. 

SpecJBB is a throughput-based" benchmark: it 
measures the amount of work accomplished in a fixed 
amount of time, rather than the amount of time re- 
quired to accomplish a fixed amount of work. To 
create runs that can be compared to determine par- 
allel speedup for GC, we run with a fixed number (8) 
of "warehouses" (i.e., mutator threads), and consid- 
ered only the first 500 collections of each run. We 
believe the mutator behavior between these collec- 
tions is sufficiently similar to make these runs com- 
parable. 

Each graph is annotated with heap configuration 
parameters used for the runs. A heap configuration 
specifies the sizes of the young and old generations 
(which are fixed in all our experiments.) For exam- 
ple, 16m : 600m indicates a young generation of 16 MB 
and an old generation size of 600 MB. The number 
of young- and old-generation collections is similar 



across all runs, including the sequential run, since 
allocation behavior is largely unaffected by collec- 
tion algorithm. (We discuss an exception below.) 

The runs were performed on a Sun Enterprise™ 
3500 server, with 8 336 MHz UltraSPARC™ pro- 
cessors sharing 2 Gbyte of memory. The collector 
we ran was a generational collector with a parallel 
semispaces young generation and a parallel mark- 
compact old generation. 

6.2 Scalability 

Figure 3 presents our results in terms of scalability 
graphs. The x-axis is the number of processors. The 
y-axis shows speedup relative to the performance of 
the parallel collector run on one processor. We also 
show the curve for linear speedup and the perfor- 
mance of the sequential form of each GC algorithm. 
Speedups for the young generation and old gener- 
ation are shown on separate graphs; speedups are 
calculated on the basis of total time for collections 
of the given type. 

Table 1 gives the average and total GC times for 
the sequential runs and the parallel runs with one 
and eight processors. 
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Table 1: Average and total collection times 



6.3 Discussion 

We outperform the sequential algorithm using 
only two processors in most cases. The only case 
where we required 3 processors was in pMarkcom- 
pact for SpecJBB. Our hypothesis is that this 
is due to an optimization present in the sequen- 
tial markcompact collector which we have not yet 



adapted to the parallel version. This dense pre- 
fix optimization avoids copying large blocks of data 
when there is only a small amount of free area to 
be reclaimed. In many applications this optimiza- 
tion eliminates a significant fraction of markcompact 
copying costs. We hope to adapt this technique to 
realize similar savings in the parallel version. 

We achieve speedup factors on 8 processors of 
between 4 and 5.5, with the exception of the old- 
generation collections of Javac. One reason for this 
is that that there were 6 old-generation collections 
in the 8-processor run, but only 4 in the 1-processor 
and sequential runs. We believe that this increase 
is caused by fragmentation introduced by parallel 
LAB allocation during young-generation collection, 
and thus is an inherent cost of parallel collection. 
Note, however, that Javac has by far the smallest 
heaps of the benchmark runs. In larger problem sizes 
this effect is much less significant. 

In the parallel mark-compact collector, we can 
measure the scalability of the individual phases sepa- 
rately. It turns out that all phases scale about as well 
as overall collection. For example, in SpecJBB, the 
overall 8-processor old-generation speedup is 4.351, 
and the speedups of the individual phases range from 
3.7, for installing forwarding pointers and redirect- 
ing references, to 5.0 for sweeping. So no particular 
phase stands out as a clear scalability bottleneck. 
Still, clearly further work is needed to attempt to 
increase scalability (or explain the factors that in- 
hibit it). 



7 Conclusions 

After exploring parallel techniques, and imple- 
menting two parallel collectors, we believe that there 
is great potential for improving both pause times and 
throughput using parallelism. 

Large multi-threaded applications are being writ- 
ten in garbage-collected languages. These applica- 
tions require heaps in the gigabyte range and be- 
yond. Sequential GC algorithms will become an 
ever-greater scaling bottlneck. If systems intended 
to support such applications stop all threads for 
garbage collection, they must use parallel techniques 
to avoid this bottleneck. 



8 Trademarks 

Sun, Sun Microsystems, Sun Enterprise, JVM, 
and Java are trademarks or registered trademarks, 
of Sun Microsystems, Inc. in the United States 



and other countries. All SPARC trademarks are 
used under license and are trademarks or registered 
trademarks of SPARC International, Inc. in the 
United States and other countries. Products bearing 
SPARC trademarks are based upon an architecture 
developed by Sun Microsystems, Inc. 
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